Background

BITHub has two aims

  1. Allowing the comparison of expression of gene (or genes) of interest across multiple datasets

  2. Allowing the comparison of expression against metadata variables in a given data-set

For the second aim, it is vital that BITHub contains the relevant metadata annotations. BITHub aims to provide 3 types of metadata annotations for each dataset in the web-browser:

In order to ensure the metadata information is displayed in a user-friendly, highly correlating metadata annotations will be removed and a subset will be used for the site. Additionally, we will also perform varianceParition analysis on the subsetted list.

Set-up

source("functions.R")
library(pander)

Metedata correlations - Bulk data

Datasets

BrainSeq

bseq = read.csv("/home/neuro/Documents/BrainData/Bulk/Brainseq/Formatted/BrainSeq-metadata.csv", header = TRUE, row.names = 1)
bseq %<>% 
  dplyr::select(-c(FQCbasicStats, perSeqQual, SeqLengthDist,KmerContent))
M = cor(data.matrix(bseq), use = "complete.obs")
corrplot(M, order = 'AOE')
*Correlation plot of metadata annotations from BrainSeq phase II. The metadata annotations are clustered based on correlation*

Correlation plot of metadata annotations from BrainSeq phase II. The metadata annotations are clustered based on correlation

Prior to running cor() function, the FQCbasicStats, perSeqQual, SeqLengthDist and KmerContent columns were removed as they contained the same value, resulting in NA.

BrainSeq metadata annotations shows duplicate information in many of its columns (e.g SampleID, SAMPLEID), which are likely a result of running the pre-processing pipeline for BITHub. Additionally, certain columns contain very similar information thus resulting in high correlation. Several QC metrics for RNA-seq QC also provide redundant information and they will be removed for downstream analysis.

The final BrainSeq annotations will contain the following columns:

bseq.annot = read.csv("../annotations/BrainSeq-metadata-annot.csv") %>%
  dplyr::filter(Include..Yes.No....Interest == "Yes") %>% 
  dplyr::select(-c(Include..Yes.No....Interest))


bseq.annot %>% 
  pander(justify = "lll",
    style = "rmarkdown",
    caption = "BrainSeq metadata annotations that will be used for BITHub ")
BrainSeq metadata annotations that will be used for BITHub
OriginalMetadataColumnName BITColumnName Type
X SampleID Sample charactertics
trimmed trimmed Sequencing metrics
numReads TotalNReads Sequencing metrics
numMapped numMapped Sequencing metrics
numUnmapped numUnmapped Sequencing metrics
overallMapRate MappingRate Sequencing metrics
concordMapRate concordMapRate Sequencing metrics
totalMapped totalMapped Sequencing metrics
mitoMapped mitoMapped Sequencing metrics
mitoRate mito_Rate Sequencing metrics
totalAssignedGene totalAssignedGene Sequencing metrics
rRNA_rate rRNA_rate Sequencing metrics
RNum SampleID Phenotype
Region StructureAcronym Sample charactertics
RIN RIN Sequencing metrics
Age AgeNumeric Phenotype
Sex Sex Phenotype
Race Ethnicity Phenotype
Dx Diagnosis Phenotype
Fetal_replicating Dev.Replicating Sample charactertics
Fetal_quiescent Dev.Quiescent Sample charactertics
OPC Adult.OPC Sample charactertics
Neurons Adult.Neurons Sample charactertics
Astrocytes Adult.Astrocytes Sample charactertics
Oligodendrocytes Adult.Oligo Sample charactertics
Microglia Adult.Microglia Sample charactertics
Endothelial Adult.Endothelial Sample charactertics
NA AgeInterval Phenotype
NA Period Phenotype
NA Regions Sample charactertics
bseq %<>% dplyr::select(contains(bseq.annot$BITColumnName))

write.csv(bseq, file = "/home/neuro/Documents/BrainData/Bulk/Brainseq/Formatted/BrainSeq-metadata-subset.csv")

BrainSpan

bspan = read.csv("/home/neuro/Documents/BrainData/Bulk/BrainSpan/Formatted/BrainSpan-metadata.csv", header = TRUE, row.names = 1)

M = bspan %>% dplyr::select(-c("Diagnosis")) %>% data.matrix() %>% cor(.,use = "complete.obs")

corrplot(M, order='AOE')

BrainSpan metadata annotations contain several duplicate and redundant columns that essentially contain the same information (e.g column_num, Age.x, Braincode). BrainSpan annotations were retrieved from multiple sources and as such, these duplicates are likely a result of different IDs they were stored under.

The following BrainSpan metadata annotations will be used for BITHub:

bspan.annot = read.csv("../annotations/BrainSpan-metadata-annot.csv") %>%
  dplyr::filter(Include..Yes.No....Interest == "Yes") %>% 
  dplyr::select(-c(Include..Yes.No....Interest))

bspan.annot  %>% 
  pander(justify = "lll",
    style = "rmarkdown",
    caption = "BrainSpan metadata annotations that will be used for BITHub ")
BrainSpan metadata annotations that will be used for BITHub
OriginalMetadataColumnName BITColumnName Type
SampleID SampleID Sample characteristics
gender Sex Phenotype
structure_acronym StructureAcronym Sample characteristics
NA Period Phenotype
NA AgeNumeric Phenotype
NA AgeInterval Phenotype
NA Diagnosis Phenotype
NA Regions Sample characteristics
NA mRIN Sequencing metrics
Hemisphere Hemisphere Sample characteristics
RIN RIN Sequencing metrics
PMI PMI Sequencing metrics
pH pH Sequencing metrics
Ethnicity Ethnicity Phenotype
bspan %<>% dplyr::select(contains(bspan.annot$BITColumnName))

write.csv(bspan, file = "/home/neuro/Documents/BrainData/Bulk/BrainSpan/Formatted/BrainSpan-metadata-subset.csv")

GTEx

gtex = read.csv("/home/neuro/Documents/BrainData/Bulk/GTEx/Formatted/GTEx-metadata.csv", header = TRUE, row.names = 1)

M = cor(data.matrix(gtex))
## Warning in cor(data.matrix(gtex)): the standard deviation is zero
corrplot(M, method = 'number')

The GTEx metadata contains comprehensive annotations of sample, sequencing and phenotype attributes. However, for BITHub, we will remove metadata annotations that have a strong positive correlation.

The following metadata annotations will be used for GTEx:

gtex.annot = read.csv("../annotations/GTEx-metadata-annot.csv") %>%
  dplyr::filter(Include..Yes.No....Interest == "Yes") %>% 
  dplyr::select(-c(Include..Yes.No....Interest))

gtex %<>% dplyr::select(contains(gtex.annot$BITColumnName))

gtex.annot %>% 
  pander(justify = "lll",
    style = "rmarkdown",
    caption = "GTEx metadata annotations that will be used for BITHub ")
GTEx metadata annotations that will be used for BITHub (continued below)
OriginalMetadataColumnName BITColumnName
SAMPID SampleID
SMRIN RIN
SMTSISCH PMI
AGE AgeInterval
SEX Sex
SMATSSCR AutolysisScore
SMNABTCH IsolationBatchID
SMNABTCHT TypeofBatch
SMNABTCHD DateofBatch
SMGEBTCH Genotype_or_Expression_Batch_ID
SMGEBTCHD DateofGenotypeorExpressionBatch
SMGEBTCHT TypeofGenotypeorExpressionBatch
SMCENTER BSS_Collection_side_code
SMTSPAX Time_spent_in_PAXgene_fixative
SME2MPRT End_2_mapping_rate
SMCHMPRS ChimericPairs
SMNTRART IntragenicRate
SMNUMGPS No_of_Gaps
SMMAPRT MappingRate_total
SMEXNCRT ExonicRate
SM550NRM BasedNormalised
SMGNSDTC GenesDetected
SMUNMPRT Rate_of_mapped_genes_unique
SM350NRM BaseNormilization
SMESTLBS LibrarySize
SMMPPD ReadsMapped
SMNTERRT IntergenicRate
SMRRNANM rRNA
SMRDTTL TotalNReads
SMMNCV Mean_Coeff_Variation
SMTRSCPT TranscriptsDetected
SMMPPDPR MappedPairs
SMUNPDRD UnpairedReads
SMNTRNRT IntronicRate
SMMPUNRT Mapped_unique_rate_of_total
SMEXPEFF ExpressionProfilingEfficiency
SMMPPDUN MappedUnique_no_dup_flags
SME2MMRT End_2_Mismatch_Rate
SME2ANTI End_2_Antisense
SME2SNSE End_Sense_2
SME1ANTI End_1_Antisense
SME1SNSE End_1_Sense
SME1PCTS End_1_Sense_percentage
SMRRNART rRNA_rate
SME1MPRT End_1_Mapping_rate
SMNUM5CD Num_of_Reads_Covered_5prime
SMDPMPRT DuplicationRateMapped
SME2PCTS Percentage_IntragenicEnd_2_Reads
DTHHRDY HardyScale
Type
Sample charactertics
Sequencing metrics
Sequencing metrics
Phenotype
Phenotype
Sample charactertics
Sample charactertics
Sample charactertics
Sample charactertics
Sample charactertics
Sample charactertics
Sample charactertics
Sample charactertics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Sequencing metrics
Phenotype
write.csv(bspan, file = "/home/neuro/Documents/BrainData/Bulk/GTEx/Formatted/GTEx-metadata-subset.csv")

PsychEncode

pe = read.csv("/home/neuro/Documents/BrainData/Bulk/PsychEncode/Formatted/PsychEncode-metadata.csv", header = TRUE, row.names = 1)

M = pe %>% 
  dplyr::select(-c(ageBiopsy, smellTestScore,smoker,Structure, StructureAcronym, Regions, Capstone_4, Adult.In7)) %>%
  data.matrix() %>% cor(.,use ='pairwise.complete.obs' )
## Warning in cor(., use = "pairwise.complete.obs"): the standard deviation is
## zero
corrplot(M)

Information regarding Row_IDs, Row_Versions, Contributing Studes and Notes will be removed for BITHub.

The following metadata annotations will be retained for PsychEncode:

pe.annot = read.csv("../annotations/PsychEncode-metadata-annot.csv") %>%
  dplyr::filter(Include..Yes.No....Interest == "Yes") %>% 
  dplyr::select(-c(Include..Yes.No....Interest))

pe %<>% dplyr::select(contains(pe.annot$BITColumnName))

pe.annot %>% 
  pander(justify = "llr",
    style = "rmarkdown",
    caption = "PsychEncode metadata annotations that will be used for BITHub ")
PsychEncode metadata annotations that will be used for BITHub
OriginalColumnName BITColumnName Type
individualID SampleID Sample charactertics
diagnosis Diagnosis Phenotype
sex Sex Phenotype
ethnicity Ethnicity Phenotype
ageDeath AgeNumeric Phenotype
Adult.Ex1 Adult.Ex1 Sample charactertics
Adult.Ex2 Adult.Ex2 Sample charactertics
Adult.Ex3 Adult.Ex3 Sample charactertics
Adult.Ex4 Adult.Ex4 Sample charactertics
Adult.Ex5 Adult.Ex5 Sample charactertics
Adult.Ex6 Adult.Ex6 Sample charactertics
Adult.Ex7 Adult.Ex7 Sample charactertics
Adult.Ex8 Adult.Ex8 Sample charactertics
Adult.In1 Adult.In1 Sample charactertics
Adult.In2 Adult.In2 Sample charactertics
Adult.In3 Adult.In3 Sample charactertics
Adult.In4 Adult.In4 Sample charactertics
Adult.In5 Adult.In5 Sample charactertics
Adult.In6 Adult.In6 Sample charactertics
Adult.In7 Adult.In7 Sample charactertics
Adult.In8 Adult.In8 Sample charactertics
Adult.Astrocytes Adult.Astrocytes Sample charactertics
Adult.Endothelial Adult.Endothelial Sample charactertics
Dev.Quiescent Dev.Quiescent Sample charactertics
Dev.Replicating Dev.Replicating Sample charactertics
Adult.Microglia Adult.Microglia Sample charactertics
Adult.OtherNeuron Adult.OtherNeuron Sample charactertics
Adult.OPC Adult.OPC Sample charactertics
Adult.Oligo Adult.Oligo Sample charactertics
structure_acronym StructureAcronym Sample charactertics
ageOnset ageOnset Phenotype
causeDeath causeDeath Phenotype
brainWeight brainWeight Phenotype
height height Phenotype
weight weight Phenotype
ageBiopsy ageBiopsy Sample charactertics
smellTestScore smellTestScore Sample charactertics
smoker smoker Sample charactertics
Capstone_4 Capstone_4 Sample charactertics
NA Period Phenotype
NA AgeInterval Phenotype
NA Regions Sample charactertics
write.csv(bspan, file = "/home/neuro/Documents/BrainData/Bulk/GTEx/Formatted/GTEx-metadata-subset.csv")

Determing drivers of variation

BrainSeq

  • Filtering - how many genes are left and why is this necessary
  • Choosing of metadata variables
#bseq <- read.csv("datasets/FormattedData/FormattedData/BrainSeq/BrainSeq-exp.csv", row.names =2)[,-1]
#ead(bseq)[1:10]
#bseq.exp <- bseq[apply(bseq >= 1, 1, sum) >= 0.1*ncol(bseq),]
#bseq.md <- read.csv("datasets/FormattedData/FormattedData/BrainSeq/BrainSeq-metadata-subset.csv", row.names = 1)
#head(bseq.md)
#form.bseq <- ~ AgeNumeric + (1|StructureAcronym) + (1|Sex) + RIN +  (1|Diagnosis) + mito_Rate + rRNA_rate + TotalNReads + MappingRate
#+ Adult.Oligo + Adult.Microglia + Adult.Endothelial
#varPar.bseq <- fitExtractVarPartModel(bseq.exp, form.bseq, bseq.md)

BrainSpan

  • Filtering
  • Choosing of metadata variables
#bspan.exp = read.csv("/home/neuro/Documents/BrainData/Bulk/BrainSpan/Formatted/BrainSpan-exp.csv", row.names = 1, check.names = FALSE) #%>%
#    column_to_rownames("EnsemblID")
#bspan.meta = read.csv("/home/neuro/Documents/BrainData/Bulk/BrainSpan/Formatted/BrainSpan-metadata.csv")


#bspan.exp <- bspan.exp[apply(bspan.exp >= 1, 1, sum) >= 0.1*ncol(bspan.exp),]
#form.bspan <- ~ AgeNumeric + (1|StructureAcronym) + (1|Sex) + (1|Period) + (1|Regions)

#form.bspan <- ~ AgeNumeric + (1|StructureAcronym) + (1|Sex) + (1|Period) + (1|Regions) 
#varPar.bspan <- fitExtractVarPartModel(bspan.exp, form.bspan, bspan.meta)

GTEx

#gtex.exp <- read.csv("datasets/FormattedData/FormattedData/Gtex/GTEx-exp.csv", row.names = 1)
#gtex.md <- read.csv("datasets/FormattedData/FormattedData/Gtex/GTEx-metadata-subset.csv")
#colnames(gtex.md)
#gtex.exp <- gtex.exp[apply(gtex.exp >= 1, 1, sum) >= 0.1*ncol(gtex.exp),]

#gtex.form <- ~ TotalNReads + rRNA_rate + (1|TypeofBatch) + (1|DateofBatch) + (1|BSS_Collection_side_code) + (1|AgeInterval) + (1|Sex) + #(1|Regions) + IntergenicRate + RIN
#varPar.gtex <- fitExtractVarPartModel(gtex.exp, gtex.form, gtex.md)
#vp <- sortCols(varPar.gtex)
#plotVarPart(vp)
#write.csv(vp, "GTEx-varPart.csv")